nlg metric
PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation
Ahmed, Faruk, Yang, Lin, Jaroensri, Tiam, Sellergren, Andrew, Matias, Yossi, Hassidim, Avinatan, Corrado, Greg S., Webster, Dale R., Shetty, Shravya, Prabhakara, Shruthi, Liu, Yun, Golden, Daniel, Wulczyn, Ellery, Steiner, David F.
Recent applications of vision-language modeling in digital histopathology have been predominantly designed to generate text describing individual regions of interest extracted from a single digitized histopathology image, or Whole Slide Image (WSI). An emerging line of research approaches the more practical clinical use case of slide-level text generation (Ahmed et al., 2024, Chen et al., 2024). However, in the typical clinical use case, there can be multiple biological tissue parts associated with a case, with each part having multiple slides. Pathologists write up a report summarizing their part-level diagnostic findings by microscopically reviewing each of the available slides per part and integrating information across these slides. This many-to-one relationship of slides to clinical descriptions is a recognized challenge for vision-language modeling in this space (Ahmed et al., 2024). The common approach taken in recent literature is to restrict modeling and analysis to single-slide cases or to manually identify a single slide within a case or part that is most representative of the clinical findings in reports (Ahmed et al., 2024, Chen et al., 2024, Guo et al., 2024, Shaikovski et al., 2024, Xu et al., 2024, Zhou et al., 2024). This strategy of selecting representative slides was also adopted in constructing one of the most widely used histopathology datasets, TCGA (Cooper et al., 2018).
- Research Report > Experimental Study (0.70)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?
He, Jianfeng, Yang, Runing, Yu, Linlin, Li, Changbin, Jia, Ruoxi, Chen, Feng, Jin, Ming, Lu, Chang-Tien
Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques.
- North America > United States > Virginia > Falls Church (0.04)
- North America > United States > Texas > Dallas County > Richardson (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Wang, Jiaan, Liang, Yunlong, Meng, Fandong, Sun, Zengkui, Shi, Haoxiang, Li, Zhixu, Xu, Jinan, Qu, Jianfeng, Zhou, Jie
Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models. We conduct experiments on five NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases. In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets. For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > China > Beijing > Beijing (0.04)
- (14 more...)
Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards
Delbrouck, Jean-Benoit, Chambon, Pierre, Bluethgen, Christian, Tsai, Emily, Almusa, Omar, Langlotz, Curtis P.
Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. These systems have achieved promising performance as measured by widely used NLG metrics such as BLEU and CIDEr. However, the current systems face important limitations. First, they present an increased complexity in architecture that offers only marginal improvements on NLG metrics. Secondly, these systems that achieve high performance on these metrics are not always factually complete or consistent due to both inadequate training and evaluation. Recent studies have shown the systems can be substantially improved by using new methods encouraging 1) the generation of domain entities consistent with the reference and 2) describing these entities in inferentially consistent ways. So far, these methods rely on weakly-supervised approaches (rule-based) and named entity recognition systems that are not specific to the chest X-ray domain. To overcome this limitation, we propose a new method, the RadGraph reward, to further improve the factual completeness and correctness of generated radiology reports. More precisely, we leverage the RadGraph dataset containing annotated chest X-ray reports with entities and relations between entities. On two open radiology report datasets, our system substantially improves the scores up to 14.2% and 25.3% on metrics evaluating the factual correctness and completeness of reports.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Indiana (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (2 more...)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Explaining Chest X-ray Pathologies in Natural Language
Kayser, Maxime, Emde, Cornelius, Camburu, Oana-Maria, Parsons, Guy, Papiez, Bartlomiej, Lukasiewicz, Thomas
Most deep learning algorithms lack explanations for their predictions, which limits their deployment in clinical practice. Approaches to improve explainability, especially in medical imaging, have often been shown to convey limited information, be overly reassuring, or lack robustness. In this work, we introduce the task of generating natural language explanations (NLEs) to justify predictions made on medical images. NLEs are human-friendly and comprehensive, and enable the training of intrinsically explainable models. To this goal, we introduce MIMIC-NLE, the first, large-scale, medical imaging dataset with NLEs. It contains over 38,000 NLEs, which explain the presence of various thoracic pathologies and chest X-ray findings. We propose a general approach to solve the task and evaluate several architectures on this dataset, including via clinician assessment.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Austria (0.04)
Jury: Evaluating performance of NLG models
Jury is an evaluation package for NLG systems. It allows using many metrics in one go. Also, it implements concurrency among evaluation metrics and supports evaluating with multiple predictions. Jury uses datasets package for metrics, and thus supports any metrics that datasets package has. Default evaluation metrics are, BLEU, METEOR and ROUGE-L.